The first plot shows all the production indicators from both the current studies and the original framework in the y-axis. Orange indicates that the indicator is only being used in the current studies, purple that it is only included in the Wiltshire framework, and green that the indicator is used in both the framework and current studies.
The x-axis shows the number of secondary data metrics that have been collected to represent those indicators. You can see that there are some indicators for which there exist many data, but many indicators for which I have found little to represent them.
Value-added market indicators are pulled from various NASS, as are the total quantity of food and forest products and production inputs. There is plenty more that might be pulled from NASS here. Imports and exports are from the Economic Research Service. The exports data are far more detailed than the imports. The former are disaggregated by category at the state level (fresh fruit, processed fruit, dairy…) which is why there are a heap of metrics for it. The import data is weak - I could only find the value of the top five agricultural imports for each state, not a total. Recalls are from FDA records, but I have not any helpful information the impact of recalls in terms of food safety. Crop diversity is represented in the richness indicator by the Cropland CROS data set, which provides estimates of the area of farmland devoted to specific crops across the US. I have disaggregated these at the county and state levels here.
You can see there is plenty more in the frameworks that are not represented by secondary data here, particularly related to the consumer side - marketability, nutrition, food waste, and safety. I suspect some of these indicators will migrate toward other dimensions in the refinement process as well. But this does help identify some gaps in the data.
Code
pacman::p_load( dplyr, ggplot2, stringr, plotly, RColorBrewer)# Load production tree with use notesprod_tree <-read.csv('data/trees/prod_tree_with_use.csv')# Counts of secondary data metricscounts <- meta %>%group_by(Indicator) %>% dplyr::summarize(count =n())# Join to Wiltshire frameworkcolors <- RColorBrewer::brewer.pal(n =3, name ='Dark2')dat <-full_join(prod_tree, counts, by =join_by(Indicator == Indicator)) %>%arrange(Indicator) %>%mutate(count =ifelse(is.na(count), 0, count),label_color =case_when( Use =='both'~ colors[1], Use =='wiltshire'~ colors[3], Use =='current'~ colors[2] ) )# [1] "#1B9E77" "#D95F02" "#7570B3"# Plotdat %>%ggplot(aes(x = Indicator, y = count)) +geom_col(color ='black',fill ='grey' ) +geom_point(data = dat,aes(x =1, y =1, color = Use),inherit.aes =FALSE,alpha =0,size =-1 ) +scale_color_manual(name ="Indicator Use:",values =c("both"= colors[1],"wiltshire"= colors[2],"current"= colors[3] ),labels =c('Both','Wiltshire Only','Current Only' ) ) +theme_classic() +theme(axis.text =element_text(size =12),axis.text.y =element_text(color =rev(dat$label_color)),axis.title =element_text(size =14),legend.text =element_text(size =12),legend.title =element_text(size =12),legend.position ="bottom",plot.margin =margin(t =10, r =75, b =10, l =10) ) +guides(color =guide_legend(override.aes =list(size =4, alpha =1)) ) +coord_flip() +labs(y ='Secondary Data Count')
Bar Plot of Indicators
Otherwise, I won’t be diving into the usual PCA exploration for the production dataset because we have collected enough metrics to put together a mostly full, mostly coherent example framework with which we can try aggregating data. This should be coming in January.
1 Crop Diversity
I wanted to highlight this cropland data layer from USDA NASS in collaboration with USGS, NRCS, and FSA, among other agencies. It’s a crop-specific LULC layer derived from satellite imagery and ground-truthing. It seems to be about the best thrust at crop diversity across regions that I’ve found, but it also is certainly tailored toward primary crops, and may not represent New England very well. I’d love to hear thoughts on how useful this would be in New England.
Shannon diversity for crop production at county level.
Similarly, we could pull crop richness out of this dataset, but I have a feeling that the bias toward commodity crops would make that a bit more problematic.
2 Distribution Plots
2.1 By County
Note that while most of the available secondary data is at the county level, the environment dimension includes a fair amount at the state level as well. This includes greenhouse gas emissions and water quality surveys. For now, I’ll just show these separately, but some creative aggregation will have to happen eventually.
Code
pacman::p_load( dplyr, purrr, ggplot2, rlang, ggpubr, tidyr)source('dev/data_pipeline_functions.R')source('dev/filter_fips.R')metrics <-readRDS('data/sm_data.rds')[['metrics']]metadata <-readRDS('data/sm_data.rds')[['metadata']]# Use metadata to get help filter by dimensionprod_meta <- metadata %>%filter(dimension =='production')# Filter to economics dimensionprod_metrics <- metrics %>%filter(variable_name %in% prod_meta$variable_name)# env_metrics$variable_name %>% unique# get_str(env_metrics)# Filter to latest year and new (post-2024) counties# And pivot wider so it is easier to get correlationsprod_county <- prod_metrics %>%filter_fips(scope ='counties') %>%get_latest_year() %>%select(fips, variable_name, value) %>%mutate(variable_name =str_split_i(variable_name, '_', 1)) %>%pivot_wider(names_from ='variable_name',values_from ='value' ) %>%unnest(!fips) %>%mutate(across(c(2:last_col()), as.numeric))# Save temp file for use in analysis scriptsaveRDS(prod_county, 'data/temp/prod_county.rds')## Plotplots <-map(names(prod_county)[-1], \(var){if (is.character(prod_county[[var]])) { env_county %>%ggplot(aes(x =!!sym(var))) +geom_bar(fill ='lightblue',color ='royalblue',alpha =0.5 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm')) } elseif (is.numeric(prod_county[[var]])) { prod_county %>%ggplot(aes(x =!!sym(var))) +geom_density(fill ='lightblue',color ='royalblue',alpha =0.5 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm')) } else {return(NULL) }})# Arrange them in 4 columnsggarrange(plotlist = plots,ncol =3,nrow =4)
Distributions of production metrics at the county level.
2.2 By State
Code
pacman::p_load( dplyr, purrr, ggplot2, rlang, ggpubr, tidyr)state_codes <-readRDS('data/sm_data.rds')[['fips_key']] %>%select(fips, state_code)prod_state <- prod_metrics %>%filter_fips(scope ='state') %>%get_latest_year() %>%select(fips, variable_name, value) %>%mutate(variable_name =str_split_i(variable_name, '_', 1)) %>%pivot_wider(names_from ='variable_name',values_from ='value' ) %>%unnest(!fips) %>%mutate(across(c(2:last_col()), as.numeric)) %>%left_join(state_codes, by ='fips')# Save temp data file for use in analysis scriptsaveRDS(prod_state, 'data/temp/prod_state.rds')# Variables to map. vars <-names(prod_state)[-c(1, 43)]## Plotplots <-map(vars, \(var){ prod_state %>%ggplot(aes(y =!!sym(var), x = state_code, color = state_code)) +geom_point(alpha =0.5,size =3 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm'),legend.position ='none' ) +labs(x ='State' )})# Arrange them in 4 columnsggarrange(plotlist = plots,ncol =4,nrow =11)
Distributions of production variables at state level
3 Bivariate Plots
Using a selection of variables at the county level.
Code
pacman::p_load( GGally)# Neat function for mapping colors to ggpairs plots# https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-valuesmap_colors <-function(data, mapping,method ="p",use ="pairwise", ...) {# grab data x <-eval_data_col(data, mapping$x) y <-eval_data_col(data, mapping$y)# calculate correlation corr <-cor(x, y, method = method, use = use) colFn <-colorRampPalette(c("blue", "white", "red"), interpolate ='spline') fill <-colFn(100)[findInterval(corr, seq(-1, 1, length =100))]# correlation plotggally_cor(data = data, mapping = mapping, color ='black', ...) +theme_void() +theme(panel.background =element_rect(fill = fill))}lower_function <-function(data, mapping, ...) {ggplot(data = data, mapping = mapping) +geom_point(alpha =0.5) +geom_smooth(color ="blue", fill ="grey", ...) +theme_bw()}# Rename variables to be shorterprod_county %>%select(-fips) %>%ggpairs(upper =list(continuous = map_colors),lower =list(continuous = lower_function),axisLabels ='show' ) +theme(strip.text =element_text(size =5),axis.text =element_text(size =5),legend.text =element_text(size =5) )
4 Correlations
Only showing correlations by county because we don’t have enough observations to run it by state.
Code
pacman::p_load( dplyr, tidyr, tibble, stringr, purrr, tidyr, ggplot2, plotly, reshape, Hmisc, viridisLite)# get_str(env_county)cor <- prod_county %>%select(-fips) %>%as.matrix() %>%rcorr()# Melt correlation values and rename columnscor_r <-melt(cor$r) %>%setNames(c('var_1', 'var_2', 'value'))# Save p valuescor_p <-melt(cor$P)p.value <- cor_p$value# Make heatmap with custom text aesthetic for tooltipplot <- cor_r %>%ggplot(aes(var_1, var_2, fill = value, text =paste0('Var 1: ', var_1, '\n','Var 2: ', var_2, '\n','Correlation: ', format(round(value, 3), nsmall =3), '\n','P-Value: ', format(round(p.value, 3), nsmall =3) ))) +geom_tile() +scale_fill_viridis_c() +theme(axis.text.x =element_text(hjust =1, angle =45)) +labs(x =NULL,y =NULL,fill ='Correlation' )# Convert to interactive plotly figure with text tooltipggplotly( plot,tooltip ='text',width =800,height =500)